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Abstract For a given matrix subspace, how can we find a basis that consists 
of low-rank matrices? This is a generalization of the sparse vector problem. It 
turns out that when the subspace is spanned by rank-1 matrices, the matrices 
can be obtained by the tensor CP decomposition. For the higher rank case, the 
situation is not as straightforward. In this work we present an algorithm based 
on a greedy process applicable to higher rank problems. Our algorithm first 
estimates the minimum rank by applying soft singular value thresholding to 
a nuclear norm relaxation, and then computes a matrix with that rank using 
the method of alternating projections. We provide local convergence results, 
and compare our algorithm with several alternative approaches. Applications 
include data compression beyond the classical truncated SVD, computing 
accurate eigenvectors of a near-multiple eigenvalue, image separation and 
graph Laplacian eigenproblems. 
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1 Introduction 

Finding a succinct representation of a given object has been one of the central 
tasks in the computer and data sciences. For vectors, sparsity (i.e., t^-norm) is 
a common measure of succinctness. For example, exploiting prior knowledge 
on sparsity of a model is now considered crucial in machine learning and 
statistics [7]. Although the naive penalization of the £°-norm easily makes the 
problem intractable, it turns out that exploiting the t 1 -norm as a regularizer 
yields a tractable convex problem and performs very well in many settings [9, 
11]. This phenomenon is strongly related to compressed sensing, which shows 
under reasonable assumptions that the ^-recovery almost always recovers a 
sparse solution for an undetermined linear system. Since matrices are more 
complex objects, one may consider different criteria on succinctness for matrices, 
namely the rank. Interestingly, a variety of concepts from sparse vector recovery 
carry over to low-rank matrix recovery, for which the nuclear norm plays a 
role analogous to the t' 1 -norm for sparse vectors [25,26]. The nuclear norm 
convex relaxation has demonstrated its theoretical and practical usefulness in 
matrix completion and other low-rank optimization tasks, and is nowadays 
accompanied by an endless array of optimization algorithms; see, e.g., [2,8,10, 
12,45,46,54], and [55] for a general overview. 

Recently, the sparse vector problem has been studied by several authors [5, 
20,52,57]. In the sparse vector problem, we are given a linear subspace S in 
R", and the task is to find the sparsest nonzero vector in S. The celebrated 
results for ^-regularization are not directly applicable, and the sparse vector 
problem is less understood. The difficulty arises from the nonzero constraints; 
a natural A -relaxation only yields the trivial zero vector. Thus the sparse 
vector problem is nonconvex in its own nature. A common approach is based 
on the hypercontractivity, that is, optimizing the ratio of two different norms. 
An algorithm that optimizes the £ 1 /£°°-ratio is studied in [20,57], and the 
algorithm in [52] works with the t 1 /t 2 -ratio. Optimization of the £ 4 /£ 2 -ratio 
was recently considered by Barak, Kelner, and Steurer [5]. A closely related 
problem is the sparse basis problem , in which we want to find a basis of S that 
minimizes the sum of the ^°-norms. In addition to imposing the sparsity of 
vectors as in the sparse vector problem, a major difficulty here lies in ensuring 
their linear independence. The recent work [60,61] is an extensive theoretical 
and practical approach to the very similar problem of sparse dictionary learning. 

In this paper, we consider the following problem, which we call the low-rank 
basis problem. Let A4 be a matrix subspace in R mxrl of dimension d. The goal 
is to 

minimize rank(Xi) + • • • + rank(A'd) 

subject to span {A-,,... ,Xd} = M. 

The low-rank basis problem generalizes the sparse basis problem. To see this, 
it suffices to identify R n with the subspace of diagonal matrices in R nxn . As a 
consequence, any result on the low-rank basis problem (1) (including relaxations 
and algorithms) will apply to the sparse basis problem with appropriate changes 
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in notation (matrix nuclear norm for diagonal matrices becomes 1-norm etc.). 
Conversely, some known results on the sparse basis problem may generalize 
to the low-rank basis problem (1). An obvious but important example of this 
logic concerns the complexity of the problem: It has been shown in Coleman 
and Pothen [15] that even if one is given the minimum possible value for 
||xi||o + • • • + ||£d||o in the sparse basis problem, it is NP-hard to find a sparse 
basis. Thus the low-rank basis problem (1) is also NP-hard in general. 

A closely related (somewhat simpler) problem is the following low-rank 
matrix problem: 

minimize rank(X) 

( 2 ) 

subject to X £ M, X ^ O. 

This problem is a matrix counterpart of the sparse vector problem. Again, (2) 
is NP-hard [15], even if M is spanned by diagonal matrices. Note that our 
problem (2) does not fall into the framework of Cai, Candes, and Shen [8], in 
which algorithms have been developed for finding a low-rank matrix X in an 
affine linear space described as A(X) = b (matrix completion problems are of 
that type). In our case we have 6 = 0, but we are of course not looking for 
the zero matrix, which is a trivial solution for (2). This requires an additional 
norm constraint, and modifications to the algorithms in [8]. 


1.1 Our contribution 

We propose an alternating direction algorithm for the low-rank basis problem 
that does (i) rank estimation, and (ii) obtains a low-rank basis. We also provide 
convergence analysis for our algorithm. Our algorithm is based on a greedy 
process, whose use we fully justify. In each greedy step we solve the low-rank 
matrix problem (2) in a certain subspace, and hence our algorithm can also 
solve the low-rank matrix problem. 

Our methods are iterative, and switch between the search of a good low-rank 
matrix and the projection on the admissible set. The second step typically 
increases the rank again. The solution would be a fixed point of such a procedure. 
We use two phases with different strategies for the first step, i.e., finding a 
low-rank matrix. 

In the first phase we find new low-rank guesses by applying the singular value 
shrinkage operator (called soft thresholding) considered in Cai, Candes, and 
Shen [8]. In combination with the subspace projection, this results in the matrix 
analog of the alternating direction algorithm proposed very recently by Qu, Sun, 
and Wright [52] for finding sparse vectors in a subspace. An additional difficulty, 
however, arises from the fact that we are required to find more than one linearly 
independent low-rank matrices in the subspace. Also note that our algorithm 
adaptively changes the thresholding parameter during its execution, which 
seems necessary for our matrix problem, although [52] fixes the thresholding 
parameter before starting their algorithm. In our experiments it turns out that 
the use of the shrinkage operator clearly outperforms alternative operations, 
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e.g., truncating singular values below some threshold, in that it finds matrices 
with correct rank quite often, but that the distance to the subspace Ad is too 
large. This is reasonable as the only fixed points of soft thresholding operator 
are either zero, or, when combined with normalization, matrices with identical 
nonzero singular values, e.g., rank-one matrices. 

As we will treat the subspace constraint as non-negotiable, we will need 
further improvement. We replace the shrinkage operator in the second phase 
by best approximation of the estimated rank (which we call hard thresholding). 
Combined with the projection onto the admissible set Ad, this then delivers 
low-rank matrices in the subspace Ad astonishingly reliably (as we shall see, 
this second phase is typically not needed when a rank-one basis exists). 

Our convergence analysis in Section 4 provides further insights into the 
behavior of the process, in particular the second phase. 


1.2 Applications 

The authors were originally motivated to consider the low-rank basis problem 
by applications in discrete mathematics [29,38,47]. It can be useful also in 
various other settings, some of which we outline below. 

Compression of SVD matrices 

Low-rank matrices arise frequently in applications and a low-rank (approximate) 
decomposition such as the SVD is often used to reduce the storage to represent 
the matrix A £ R mxn : A ss U£V T . Here £ £ R rxr where r is the rank. The 
memory requirement for storing the whole A is clearly mn, whereas U, £, V 
altogether require (m + n)r memory (we can dismiss £ by merging it into V). 
Hence, the storage reduction factor is 

(m + n)r 
mn 

so if the rank r is much smaller than min(m, n) then we achieve significant 
memory savings. 

This is all well known, but here we go one step further and try to reduce 
the memory cost for representing the matrices U, V. Note that the same idea 
of using a low-rank representation is useless here, as these matrices have 
orthonormal columns and hence the singular values are all ones. 

The idea is the following: if we matricize the columns of U (or V), those 
matrices might have a low-rank structure. More commonly, there might exist 
a nonsingular matrix W £ R rxr such that the columns of UW have low-rank 
structure when matricized. We shall see that many orthogonal matrices that 
arise in practice have this property. The question is, then, how do we find such 
W and the resulting compressed representation of U1 This problem boils down 
to the low-rank basis problem, in which Ad is the linear subspace spanned by 
the matricized columns of U. 
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To simplify the discussion here we assume m = s 2 for an integer s (otherwise, 
e.g. when m is prime, a remedy is to pad zeros to the bottom). Once we find an 
appropriate W for U G R s xr , we represent the matricization of each column 
as a low-rank (rank f) matrix UfSVf , which is represented using 2 sr memory, 
totaling to 2 srf + r 2 where r 2 is for W. Since the original U requires s 2 r 
memory with r <C s 2 , this can significantly reduce the storage if f <Cr. 

When this is employed for both U and V the overall storage reduction for 
representing A becomes 


4 srf + r 2 
mn 


( 4 ) 


For example, when m = n, r = 5m and f = 5s for 5,5 <C 1 this factor is 


4 55+ 5 2 , 


( 5 ) 


achieving a “squared” reduction factor compared with (3), which is about 25. 

Of course, we can further reduce the memory by recursively matricizing the 
columns of Up , as long as it results in data compression. 


Computing and compressing an exact eigenvector of a multiple eigenvalue 

Eigenvectors of a multiple eigenvalue are not unique, and those corresponding 
to near-multiple eigenvalues generally cannot be computed to high accuracy. 
We shall show that it is nonetheless sometimes possible to compute exact 
eigenvectors of near-multiple eigenvalues, if additional property is present that 
the eigenvectors are low-rank when matricized. This comes with the additional 
benefit of storage reduction, as discussed above. We describe more details and 
experiments in Section 5.4. 


The rank-one basis problem 

An interesting and important subcase of the low-rank basis problem is the 
rank-one basis problem ; in this problem, we are further promised that a given 
subspace A4 is spanned by rank-one matrices. Gurvits [29] first considered the 
rank-one basis problem in his work on the Edmonds problem [24]. Below let us 
explain his original motivation and connections to combinatorial optimization. 

The task of Edmonds’ problem is to find a matrix of maximum rank in 
a given matrix subspace. It is known that sampling a random element in a 
matrix subspace yields a solution with high probability, and therefore the main 
interest is to design deterministic algorithms (of course, this motivation is 
closely related to polynomial identity testing and the P vs BPP conjecture [49]). 
For some special cases, one can devise a deterministic algorithm by exploiting 
combinatorial structure of a given subspace. In particular, if a given basis 
consists of rank-one matrices (which are known), Edmonds’ problem reduces 
to linear matroid intersection [47], which immediately implies the existence of 
deterministic algorithms. 
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Gurvits [29] studied a slightly more general setting: a given subspace is only 
promised to be spanned by some rank-one matrices, which are unknown. If one 
can solve the rank-one basis problem, this setting reduces to the previous case 
using the solution of rank-one basis problem. Indeed, he conjectured that the 
rank-one basis problem is NP-hard, and designed a deterministic algorithm 
for his rank maximization problem without finding these rank-one matrices 
explicitly. We also note that Ivanyos et al. [38] investigated the Edmonds 
problem for a matrix subspace on finite fields, which is useful for max-rank 
matrix completion (see [31,32] and references therein). 

Tensor decomposition 

The rank-one basis problem has an interesting connection to tensor decomposi¬ 
tion: finding a rank-one basis for a d-dimensional matrix subspace amounts 
to finding a CP decomposition (e.g., [40, Sec. 3]) of representation rank d for 
a third-order tensor with slices Mi ,..., Md that form a basis for Ai . For the 
latter task very efficient nonconvex optimization algorithm like alternating 
least squares exist, which, however, typically come without any convergence 
certificates. An alternative, less cheap, but exact method uses simultaneous 
diagonalization, which are applicable when d < min(m, n). Applying these 
methods will often be successful when a rank-one basis exists, but fails if not. 
This tensor approach seems to have been overseen in the discrete optimization 
community so far, and we explain it in Appendix A. 

Even in general, when no rank-one basis exists, the representation of 
matrices M\,, Md in a low-rank basis can be seen as an instance of tensor 
block term decomposition [17] with d blocks. Hence, given any tensor with slices 
Mi ,..., Md (not necessarily linearly independent), our method may be used to 
obtain a certain block term decomposition with low-rank matrices constrained 
to the span of Mi,..., Md- These matrices may then be further decomposed 
into lower-rank components. In any case, the resulting sum of ranks of the basis 
provides an upper bound for the CP rank of the tensor under consideration. 

While in this paper we do not dwell on the potential applications of our 
method to specific tensor decomposition tasks as they arise in signal processing 
and data analysis (see [14,40] for overview), we conduct a numerical comparison 
with a state-of-the-art tensor algorithm for CP decomposition for synthetic 
rank-one basis problems; see Section 5.1.1. 


1.3 Outline and notation 

The rest of this paper is organized as follows. Section 2 proves that a greedy 
algorithm would solve the low-rank basis problem, if each greedy step (which 
here is NP-hard) is successful. In Section 3, we present our algorithm for the 
low-rank basis problem that follows the greedy approach. We then analyze 
convergence of phase II in our algorithm in Section 4. Experimental evaluation 
of our algorithm is illustrated in Section 5. For the special case of the rank-one 
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basis problem, we describe the alternative approach via tensor decomposition 
in Appendix A. 


Notation We summarize our notation: mxnis the size of the matrices in Ad; 
d is the dimension of the subspace Ad C R mxn ; r will denote a rank. We use 
the notation mat(x) £ R Tnxrl f or the matricization of a vector x £ R mra , and 
vec(A') £ K m ” denotes the inverse operation for X £ K mx ”. 


2 The abstract greedy algorithm for the low-rank basis problem 

As already mentioned, the low-rank basis problem (1) for a matrix subspace 
Ad is a generalization of the sparse basis problem for subspaces of R n . In [15] it 
was shown that a solution to the sparse basis problem can be in principle found 
using a greedy strategy. The same is true for (1), as we will show next. The 
corresponding greedy algorithm is given as Algorithm 1. Indeed, this algorithm 
can be understood as a greedy algorithm for an infinite matroid [51] of finite 
rank. We can prove that Algorithm 1 finds a minimizer of (1), by adapting a 
standard proof for greedy algorithms on finite matroids. Note that this fact 
does not imply that (1) is tractable, since finding Xf in the algorithm is a 
nontrivial task. 


Algorithm 1: Greedy meta-algorithm for computing a low-rank basis 

Input: Subspace M C R’ ,lXrl of dimension d. 

Output : Basis B = {Xfi..., X t } of A4. 

1 Initialize B = 0. 

2 for i = 1 , ,d do 

3 Find S .VJ of lowest possible rank such that B U { XJ } is linearly independent. 

4 B <— BU {Xf} 

5 end 


Lemma 1 Let X* ,, X f ) denote matrices constructed by the greedy Algo¬ 
rithm 1. Then for any 1 < l < d and linearly independent set {AA,..., Xf\ C 
Ad with rank(Ai) < • • • < rank(AA), it holds 

rank(AA) > rank(X*) for i = 

Proof The proof is by induction. By the greedy property, X* is a (nonzero) 
matrix of minimal possible rank in Ad, i.e., rank(X*) < rank(Xi). For l > 1, if 
rank(X^) < rank(X^), then rank(Xj) < rank(X^) for all i = 1,... ,i. But since 
one Xi must be linearly independent from X^,..., X^_ 1; this would contradict 
the choice of Xf in the greedy algorithm. □ 
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We say a linearly independent set Be = {Xi,..., Xg} C J4 is of minimal 
rank , if 

*■ A r *■ \ 

'Y rank(Xj) = min < rank(Xi) : {Xi,..., Xe} C A1 is linearly independent > . 

i=i l i =1 J 

The following theorem is immediate from the previous lemma. 


Theorem 2 Let X J”,..., XJ denote matrices constructed by the greedy Algo¬ 
rithm 1, and let Be = {Xi,..., X^} C Ad be a linearly independent set with 
rank(Xi) < • • • < rank(X^). Then Be is of minimal rank if (and hence only if) 

rank(Xj) = rank(X*) for i = 

In particular, {X (,..., Xf) is of minimal rank. 

A remarkable corollary is that the ranks of the elements in a basis of lowest 
rank are essentially unique. 

Corollary 3 The output B = {X*,..., A'*]} of Algorithm 1 solves the low-rank 
basis problem (1), that is, provides a basis for A1 of lowest possible rank. Any 
other basis of lowest rank takes the same ranks rank(X^) up to permutation. 

It is worth mentioning that even for the analogous sparse basis problem our 
results are stronger than Theorem 2.1 in [15] (which only states that B will 
be a sparsest basis). Moreover, our proof is different and considerably simpler. 
We are unaware whether the above results on the particular low-rank basis 
problem have been observed previously in this simple way. 


3 Finding low-rank bases via thresholding and projection 

In this main section of this article, we propose an algorithm that tries to 
execute the abstract greedy Algorithm 1 using iterative methods on relaxed 
formulations. 

The greedy algorithm suggests finding the matrices X \,..., Xd one after 
another, during which we monitor the linear dependence when computing 
Xe with respect to the previously computed Xi,..., X(~ j. and apply some 
restart procedure when necessary. Alternatively, one can try to find low-rank 
matrices Xi,..., Xd € Al in parallel, monitor their linear independence, and 
reinitialize the ones with largest current ranks in case the basis becomes close 
to linearly dependent. In both cases, the linear independence constraint, which 
substantially increases the hardness of the problem, is in principle ignored as 
long as possible, and shifted into a restart procedure. Therefore, we mainly 
focus on iterative methods to solve the problem (2) of finding a single low-rank 
matrix X in Al. The complete procedure for the low-rank basis problem will 
be given afterwards in Section 3.3. 
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The first algorithm we consider for solving the low-rank basis problem (2) 
alternates between soft singular value thresholding (shrinkage) and projection 
onto the subspace M ., and will be presented in the next subsection. During out¬ 
work on this paper, an analogous method for the corresponding sparse vector 
problem of minimizing ||rr||o over x £ S, ||rr ||2 = 1 has been independently 
derived and called a “nonconvex alternating direction method (ADM)” for 
a modified problem in the very recent publication [52]. This reference also 
contains a motivation for using the Euclidean norm for normalization. We have 
decided to adopt this derivation, but will use the terminology block coordinate 
descent (BCD) instead, which seems more in line with standard terminology 
regarding the actual update rules. However, as it turns out, this algorithm alone 
indeed provides good rank estimators, but very poor subspace representations. 
This is very understandable when the target rank is higher than one, since 
the fixed points of the singular value shrinkage operator explained below are 
matrices whose positive singular values are all equal, which do not happen in 
generic cases. Therefore, we turn to a second phase that uses hard singular value 
thresholding (rank truncation) for further improvement, as will be explained 
in Section 3.2. 


3.1 Phase I: Estimating a single low-rank matrix via soft thresholding 

The starting point is a further relaxation of (2): the rank function, that is, the 
number of nonzero singular values, is replaced by the matrix nuclear norm 
|| X ||*, which equals the sum of singular values. This leads to the problem 


minimize ||Aj|* 

subject to X G A4, and HA'Hf = 1. 


( 6 ) 


The relaxation from rank to nuclear norm can be motivated by the fact that 
in case a rank-one solution exists, it will certainly be recovered by solving (6). 
For higher ranks, it is less clear under which circumstances the nuclear norm 
provides an exact surrogate for the rank function under the given spherical 
constraint. For an example of a space A4 spanned by matrices of rank at most 
r for which the minimum in (6) is attained at a matrix of rank larger than r, 
consider Mi = diag(l, —y/e, —</e, 0, 0,0,0), M 2 = diag(0,1,0, y'e, i/e, 0, 0), and 
M 3 = diag(0, 0,1,0, 0, y/e, yfi). Every linear combination involving at least two 
matrices has at least rank four. So the Mi are the only matrices in their span of 
rank at most three. After normalization w.r.t. the Frobenius norm, their nuclear 
norm equals ||M i ||*/||Mj||p’ = (1 + 2 v / e)/-\A + 2e = 1 + 2-^/e + O(e). But for e 
small enough, the rank five matrix X = M\ — v / eM 2 — \feM 3 = (1,0, 0, e, e, e, e) 
has a smaller nuclear norm ||A||»/||A||i? = (1 + 4e)/Vl + 4e 2 = 1 + 4e + 0(e 2 ) 
after normalization. 
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Soft thresholding and block coordinate descent 


Nevertheless, such counterexamples are rather contrived, and we consider (6) a 
good surrogate for (2) in the generic case. The problem is still very challenging 
due to the non-convex constraint. In [52] a block coordinate descent (BCD) 
method has been proposed to minimize the (f -norm of a vector on an Euclidean 
sphere in a subspace of R n . As we explained above, this problem is a special 
case of (6), and the algorithm can be generalized as follows. 

Given a current guess X £ Ad we are looking for a matrix Y of lower-rank 
in a neighborhood if possible. For this task, we use the singular value shrinkage 
operator S T [8] : letting X = USV T be a singular value decomposition with 
A = diag(oi,..., <r rank (x)), ci > ■ ■ ■ > cr rank (x) > 0, we choose Y as 

Y = S T (X) = US t (£)V t , S t (E) =diag(o- 1 - r, ■ ■ •, c rank(x) - r)+, (7) 

where := max(x, 0) and r > 0 is a thresholding parameter. The rank of 
Y now equals the number of singular values cq larger than r. Note that even 
if the rank is not reduced the ratios of the singular values increase, since 
(<t i — t)/ ((T j — t) > <7i/<Jj for all ( i,j ) such that t < crj <Oi. Hence a successive 
application of the shift operator will eventually remove all but the dominant 
singular value(s), even if the iterates are normalized in between (without in- 
between normalization it of course removes all singular values). This effect is 
not guaranteed when simply deleting singular values below the threshold r 
without shifting the others, as it would preserve the ratios of the remaining 
singular values, and might result in no change at all if r is too small. But even 
if this hard threshold was chosen such that at least one singular value is always 
removed, we found through experiments that this does not work as well in 
combination with projections onto a subspace Ad as the soft thresholding. 

The new matrix Y in (7) will typically not lie in Ad, nor will it be normalized 
w.r.t. the Frobenius norm. Thus, introducing the orthogonal projector (w.r.t. 
the Frobenius inner product) Vm from R mxrt onto Ad, which is available given 
some basis M \,..., Md of M, we consider the fixed point iteration: 


Y=S r (X), 


Pm (Y) 

\\Pm(Y)\\f' 


( 8 ) 


The projection Vm is precomputed at the beginning and defined as MM T 
(only M is stored), where M is the orthogonal factor of the thin QR decom¬ 
position [27, Sec. 5] of the matrix [vec (Mi ),..., vec (Md)] G K mnxd ; where the 
(not necessarily low-rank) matrices M* span Ad. It is used in the following way: 


Pm(Y) = mat(MM T vec(Y)), 


(9) 


To obtain the interpretation as BCD, we recall the fact that the shrinkage 
operation provides the unique solution to the strictly convex problem 


minimize 

Y 


r\\Y\\l + l\\Y-X 


\ 2 f, 


( 10 ) 
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see [8]. Intuitively, (10) attempts to solve (6) in a neighborhood of the current 
guess X, while ignoring the constraints. The parameter r controls the balance 
between small nuclear norm and locality: the larger it is the lower rank(Y) 
becomes, but Y will be farther from X. Taking r small has the opposite effect. 
The explicit form (7) quantifies this qualitative statement, as the distance 
between X and Y is calculated as 

iia - - Yf„ =y> 2 +E «. 2 . 

Oi>T (Ji<T 

As a result, we see that the formulas (8) represent the update rules when 
applying BCD to the problem 

minimize r||F||* + * |[Y - X\\% 

subject to X G A4, and ||X|| F = 1, 

which can be seen as a penalty approach to approximately solving (6). 

Algorithm with adaptive shift r 

The considerations are summarized as Algorithm 2. At its core it is more or 
less analogous to the algorithm in [52]. However, a new feature is that the 
parameter r is chosen adaptively in every iteration. 


Algorithm 2: Phase I - Rank estimation 


Input: Orthogonal projection Vm on -M; scalars 8, maxit, changeit > 0, rtoi > 0; 

initial guess X E A4 with ||X||_p = 1, initialize r = n. 

Output: X, y, and r, where X = Vm(Y) £ -M, and Y is a matrix of low rank r 
which is close to or in A4. 


for it = 1,..., maxit do 

2 X = UZV T 

3 S = \{<Ji : CJi > Ttol} I 

4 X^Ts(X)/\\Ts{X)\\ f 

5 r = 8/y/s 

6 Y^S t {X) 
r 4— min(r, rank(y)) 

X<-V m (Y) 
X<-X/\\X\\ F 


II singular value decomposition 
// ‘noiseless’ rank 
// remove ‘noisy’ singular values 
// set singular value shift 
// shrinkage 
// new rank estimate 
// projection onto subspace 
// normalization 


10 
li end 


Terminate if r has not changed for changeit iterations. 


The choice of the singular value shift r in line 5 is made to achieve faster 
progress, and motivated by the fact that a matrix X of Frobenius norm 1 
has at least one singular value below and one above 1 /y / rank(X), unless all 
singular values are the same. Therefore, the choice of r = l/y / rank(X) would 
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always remove at least one singular value, but can also remove all but the 
largest singular value in case the latter is very dominant. To make the shift 
more efficient in case that many small singular values are present, we rely 
instead on an effective rank s which is obtained by discarding all singular 
values below a minimum threshold T to i, considered as noise (line 3). On the 
other hand, since the sought low-rank matrix may happen to have clustered 
dominant singular values, it is important to choose a less aggressive shift by 
multiplying with 0 < <5 < 1. The choice of the parameters T to i and 6 is heuristic, 
and should depend on the expected magnitudes and distribution of singular 
values for the low rank bases. In most of our experiments we used S = 0.1 and 
Ttoi = 10 -3 . With such a small value of T to i (compared to 6) the truncation 
and normalization in line 4 are optional and make only a subtle difference. 

Of course, one may consider alternative adaptive strategies such as r ~ 
||Af||*/rank(X) or r ~ cr rank ( A -)- Also r to ; and <5 may be set adaptively. 

Choice of the initial guess 

Let us remark on the choice of the initial guess. As we shall see later and can 
be easily guessed, with randomly generated initial X € A4, the output r is 
not always the rank of the lowest-rank matrix in A4. A simple way to improve 
the rank estimate is to repeat Phase I with several initial matrices, and adopt 
the one that results in the smallest rank. Another “good” initial guess seems 
to be the analogue of that suggested in [52], but this is intractable because 
the analogue here would be to initialize with a projection onto every possible 
rank-one matrix, and there are clearly infinitely many choices. We therefore 
mainly employ random initial guesses, and leave the issue of finding a good 
initial guess an open problem. 

The use as a rank estimator 

In our experiments we observed that Algorithm 2 alone is often not capable of 
finding a low-rank matrix in the subspace. Typically the two subsequences for 
Y and X in (8) produce two different numerical limits: Y tends to a low-rank 
matrix which is close to, but not in the subspace A4; by contrast, the X are 
always in the subspace, but are typically not observed to converge to a low- 
rank matrix. In fact, we can only have X = Y in (8) for rank-one matrices, in 
addition to the special case where the rank is higher but the singular values are 
all equal. Therefore, in the general case, further improvement will be necessary 
(phase II below). However, as it turns out, the rank found by the sequence 
of Y provides a surprisingly good estimate also for the sought minimal rank 
in the subspace. Moreover, the obtained X also provides the starting guess 
X = Vm for further improvement in the second phase, described next. An 
analogous statement was proven in [53] for the sparse vector problem (which 
can be regarded as a special case of ours), but the analysis there assumes the 
existence of a sparse vector in a subspace of otherwise random vectors; here 
we do not have such (or related) assumptions. In Section 4.1 we give some 



Finding a low-rank basis in a matrix subspace 


13 


qualitative explanation for why we expect this process to obtain the correct 
rank, but we leave a detailed and rigorous analysis of Algorithm 2 an open 
problem and call this preliminary procedure “Rank estimation”. 


3.2 Phase II: Finding a matrix of estimated rank via alternating projections 

We now turn to the second phase, in which we find the matrix X £ A4 such 
that rank(X) = r, the output rank of Phase I. Essentially, the soft singular 
value thresholding in Phase I is replaced by hard thresholding. 


Alternating projections 


In Phase II of our algorithm we assume that we know a rank r such that M. 
contains a matrix of rank at most r. To find that matrix, we use the method 
of alternating projections between the Euclidean (Frobenius) unit sphere in 
the subspace M. and the closed cone of matrices of rank at most r. The metric 
projection (in Frobenius norm) on this cone is given by the singular value 
truncation operator T r defined as 


TAX) = UTr(S)V T , TAX) = diag(ai,..., a r , 0,..., 0), 


where X = UXV T is an SVD of X with S = diag(oi,..., <J m i n (m,n))- The 
method of alternating projections hence reads 


Y = TAX), 


Vm{Y) 

\\Vm{Y)\\f' 


( 11 ) 


Conceptually, this iteration is the same as (8) with the soft thresholding 
operator S T replaced by the hard thresholding operator %■ Alternatively, (11) 
can be interpreted as employing BCD for the problem 


minimize IIP — A'|| p 

X,Y 

subject to X £ M, ||A||^ = 1, and rank(F) < r. 


As a result, we obtain Algorithm 3. 

The authors of the aforementioned reference [52], who proposed Algorithm 2 
for the sparse vector problem, also suggest a second phase (called “rounding”), 
which is, however, vastly different from our Algorithm 3. It is based on linear 
programming and its natural matrix analogue would be to solve 

minimize ||.X1L 

- (12) 

subject to X £ M, and tr(X T X) = 1. 


Here X is the final matrix X from Algorithm 2. In [52,53] it is shown for the 
vector case that if X is sufficiently close to a global solution of (6) then we can 
recover it exactly by solving (12). 
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Algorithm 3: Phase II Alternating projections 

Input: Orthogonal projection Vm on Al; rank r > 1; scalars maxit,tol > 0; initial 
guess -Y g At 

Output: X, Y. where X = VmX) g .VI. and Y is a matrix of rank r, and hopefully 

\\x- y\\ f <toi. 

1 while ||-Y — Y\\p > tol and it < maxit do 

2 .Y = USV T // singular value decomposition 

3 Y 4—T r (X) II best rank-r approximation 

4 X4—Vm(Y) II projection onto subspace 

5 X 4— X/\\X\\p II normalization 

6 end 


Comparison with a convex optimization solver 

Note that unlike our original problem (1) or its nuclear norm relaxation (6), (12) 
is a convex optimization problem, since the constraints are now the linear 
constraint tr(A r A) = 1 along with the restriction in the subspace X £ 
M. Nuclear norm minimization under linear constraints has been intensively 
considered in the literature, see [8,55] and references therein for seminal work. 
A natural approach is to attempt to solve (12) by some convex optimization 
solver. 

In view of this, we conduct experiments to compare our algorithm with one 
where Phase II is replaced by the cvx optimization code [28] . For the test we 
used the default tolerance parameters in cvx. We vary n while taking m = n 
and generated the matrix subspace A4 so that the exact ranks are all equal 
to five. Here and throughout, the numerical experiments were carried out in 
MATLAB version R2014a on a desktop machine with an Intel Core i7 processor 
and 16GB RAM. 

The runtime and accuracy are shown in Table 1. Here the accuracy is mea¬ 
sured as follows: letting Xi for i = 1,..., d be the computed rank-1 matrices, 
we form the mn x d matrix X = [vec(Xi),..., vec(Ad)], and compute the 
error as the subspace angle [27, Sec. 6.4.3] between X and the exact subspace 
[vec(M!),..., vec(Md)]. Observe that while both algorithms provide (approxi¬ 
mate) desired solutions, Algorithm 5 is more accurate, and much faster with the 
difference in speed increasing rapidly with the matrix size (this is for a rough 
comparison purpose: cvx is not very optimized for nuclear norm minimization). 


Table 1: Comparison between Alg. 5 and Phase II replaced with cvx. 


(a) Runtime(s). (b) Error 


n 

20 

30 

40 

50 

n 

20 

30 

40 

50 

Alg. 5 

1.4 

2.21 

3.38 

4.97 

Alg. 5 

2.9e-15 

8.0e-15 

9.5e-15 

2.1e-14 

CVX 

28.2 

186 

866 

2960 

CVX 

6.4e-09 

5.2e-10 

5.7e-10 

2.8e-10 











Finding a low-rank basis in a matrix subspace 


15 


Another approach to solving (12) is Uzawa’s algorithm as described in [8]. 
However, our experiments suggest that Uzawa’s algorithm gives poor accuracy 
(in the order of magnitude 10 -4 ), especially when the rank is not one. 

In view of these observations, in what follows we do not consider a general- 
purpose solver for convex optimization and focus on using Algorithm 3 for 
Phase II. 

3.3 A greedy algorithm for the low-rank basis problem 
Restart for linear independence 

Algorithms 2 and 3 combined often finds a low-rank matrix in M .. To mimic the 
abstract greedy Algorithm 1, this can now be done consecutively for i = 1 ,,d. 
However, to ensure linear independence among the computed matrices A,;, 
a restart procedure may be necessary. After having calculated X \,..., 
and ensured that they are linearly independent, the orthogonal projector Pg-i 
onto span{Xi,..., A^_i} is calculated. While searching for X? the norm of 
Xu — Vi-\[Xf) is monitored. If it becomes too small, it indicates (since Xg is 
normalized) that Xe is close to linearly dependent on the previously calculated 
matrices X\,... , X^_i. The process for Xp is then randomly restarted in the 
orthogonal complement of span{Xi,..., Xf_i} within Ai, which is the range 

of Qe-i = Vm — Pe- 1 - 


Algorithm 4: Restart for linear independence 
Input: Orthogonal projection Q^_i, matrix Xi and tolerance restarttol > 0. 
Output: Eventually replaced Xu. 

1 if ||Q<?_ipQ)|| P < restarttol then 

2 Replace Xu by a random element in the range of i- 

3 X t ^X t /\\Xt\\ F 

4 end 


In our implementation, we do not apply a random matrix to Qg -1 to 
obtain a random element in the range. Instead Qi-i is stored in the form of 
an orthonormal basis for which random coefficients are computed. We note 
that restarting was seldom needed in our experiments, except for the image 
separation problem in Section 5.3. A qualitative explanation is that the space 
Qi = Pm ~ Pi from which we (randomly) obtain the next initial guess is rich in 
the components of matrices that we have not yet computed, thus it is unlikely 
that an iterate X^ converges towards an element in Pg. 

Final algorithm and discussion 

The algorithm we propose for the low-rank basis problem is outlined as Algo¬ 
rithm 5. Some remarks are in order. 
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Algorithm 5: Greedy algorithm for computing a low-rank basis 
Input: Basis Mi,... M^ S R mxn for At, and integer restartit > 0. 

Output: Low-rank basis A i...., X f i of At. 

1 Assemble the projection 'P_vt • 

2 Set Qo = Vm • 

3 for l = 1,..., d do 

4 Initialize normalized Xg randomly in the range of Qg—g. 

5 Run Algorithm 2 (Phase I) on Ay, but every restartit iterations run Algorithm 4. 

6 Run Algorithm 3 (Phase II) on Xg, but every restartit iterations run 
Algorithm 4. 

7 Assemble the projection Vg on span [X ].... ,Xg} (ensure linear independence), 
s Set Q e = V M - Pi- 

9 end 


1. The algorithm is not stated very precisely as the choice of many input 
parameters are not specified. We will specify at the beginning of Section 5 
the values we used for numerical experiments. 

2. Analogously to (9), the projections Vg are in practice obtained from a QR 
decomposition of [vecfWr),..., vec(A^)]. 

3. Of course, after Algorithm 2 one can or should check whether it is necessary 
to run Algorithm 3 at all. Recall that Algorithm 2 provides a rank-estimate 
r and a new matrix Xg £ A4. There is no need to run Algorithm 3 in 
case rank(Ay) = r at that point. However, we observed that this seems 
to happen only when r = 1 (see next subsection and Section 5.1), so in 
practice it is enough to check whether r = 1. 

4. There is a principal difference between the greedy Algorithm 5, and the 
theoretical Algorithm 1 regarding the monotonicity of the found ranks. For 
instance, there is no guarantee that the first matrix Xi found by Algorithm 5 
will be of lowest possible rank in At, and it will often not happen. It seems 
to depend on the starting value (recall the discussion on the initial guess 
in Section 3.1), and an explanation may be that the soft thresholding 
procedure gets attracted by “some nearest” low-rank matrix, although a 
rigorous argument remains an open problem. In any case, finding a matrix 
of lowest rank is NP-hard as we have already explained in the introduction. 
In conclusion, one should hence not be surprised if the algorithm produces 
linearly independent matrices X\,... ,Xd for which the sequence rank(Ay) 
is not monotonically increasing. Nevertheless, at least in synthetic test, 
where the exact lowest-rank basis is exactly known and not too badly 
conditioned, the correct ranks are often recovered, albeit in a wrong order; 
see Figures 1, 3 and Section 5.1. 

5. It is interesting to note that in case the restart procedure is not activated 
in any of the loops (that is, the if-clause in Algorithm 4 always fails), 
one would have been able to find the matrices Xi ,..., X,i independently 
of each other, e.g., with a random orthogonal basis as a starting guess. 
In practice, we rarely observed that restart can be omitted, although it 
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might still be a valuable idea to run the processes independently of each 
other, and monitor the linear independence of X \,..., X ( [ as a whole. If 
some of the Xg start converging towards the same matrix, or become close 
to linearly dependent, a modified restart procedure will be required. A 
practical way for such a simultaneous restart is to use a QR decomposition 
with pivoting. It will provide a triangular matrix with decreasing diagonal 
entries, with too small entries indicating basis elements that should be 
restarted. Elements corresponding to sufficiently large diagonal elements in 
the QR decomposition can be kept. We initially tried this kind of method, 
an advantage being that the overall cost for restarts is typically lower. We 
found that it typically works well when all basis elements have the same 
rank. However, the method performed very poorly in situations where the 
ranks of the low-rank basis significantly differ from each other. Nonetheless, 
this “parallel” strategy might be worth a second look in future work. 


3.4 Complexity and typical convergence plot 

The main step in both Algorithms 2 and 3 is the computation of an SVD of 
an to x n matrix, which costs about 14mn 2 + 8 n 3 flops [27, Sec. 8.6]. This 
is done at most 2 maxit times (assuming the same in both algorithms) for d 
matrices. Since we check the need for restarting only infrequently, the cost 
there is marginal. The overall worst-case complexity of our greedy Algorithm 5 
hence depends on the choice of maxit. In most of our experiments, the inner 
loops in Algorithms 2 and 3 terminated according to the stopping criteria, long 
before maxit iterations was reached. 

Figure 1 illustrates the typical convergence behavior of Algorithm 5 for a 
problem with to = 20, n = 10, and d = 5, constructed randomly using the 
MATLAB function randn, and the exact ranks for a basis are (1, 2,3,4, 5). The 
colors correspond to £ = 1,... ,5: these are also indicated at the top of the 
figure. For each £, the evolution of the n = 10 singular values of Xi are plotted 
during the iteration (always after projection on A4). The shaded areas show 
Phase I, the unshaded areas Phase II. In both phases we used maxit = 1000 
and restai'tit = 50, moreover, Phase I was terminated when the rank did not 
change for changeit = 50 iterations (which appears to be still conservative), 
while Phase II was terminated when X( remained unchanged up to a tolerance 
of 10~ 14 in Frobenius norm. The number of SVDs is in principle equal to the 
number of single iterations, and governs the complexity, so it was used for the 
x-axis. Illustrating the fourth remark above, the ranks are not recovered in 
increasing order, but in the order (1, 5,3,2,4) (corresponding to the number of 
curves per £ not converging to zero). Repeated experiments suggest that all 
orderings of rank recovery is possible. 

Regarding the convergence of a single matrix, we make two observations in 
Figure 1. First, one can nicely see that in Phase I the singular values beyond 
a ri usually decrease until they stagnate at a certain plateau. The length of 
this plateau corresponds to the 50 iterations we have waited until accepting an 
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1 = 1 1 = 2 1 = 3 1 = 4 £ = 5 



Fig. 1: Typical convergence of Algorithm 5 for a randomly generated problem 
with m = 20, n = 10, d = 5. Each color represents one outmost loop t = 1,..., 5. 
The shaded regions indicate Phase I, and the rest is Phase II. The exact ranks 
are (1, 2, 3,4, 5), but recovered in the order (1, 5, 3,2,4). 


unchanged rank (before projection) as a correct guess. Except for the first (red) 
curve, which shows convergence towards a rank-one basis element, the error 
level of this plateau is too high to count as a low-rank matrix. Put differently, 
the (normalized) low-rank matrix Yf found by the shrinkage, on which the rank 
guess is based, is unacceptably far away from the subspace, illustrating the 
need for Phase II. Phase II seems unnecessary only when a rank-one matrix is 
targeted. 

Second, the convergence of oy+i, • • • , &n towards zero in the second phase 
is typically linear, and tends to be faster if the limit is lower in rank. We give 
an explanation for this observation in Section 4. 

The fact that the matrices are obtained in somewhat random order can 
be problematic in some cases, such as the low-rank matrix problem where 
only one matrix of lowest rank is sought. One remedy is to try multiple initial 
guesses for Phase I, and adopt the one that results in the lowest rank estimate 
r. Figure 2 is a typical illustration when three initial guesses are attempted. 
The shaded regions represent Phase I repeated three times for each greedy 
step, and the ones that were adopted are shaded darkly. Observe that now the 
matrices are obtained in the correct rank order (1,2,3,4, 5). 

Finally, Figure 3 contains an outcome of the initial experiment (without 
multiple initial guesses) with square matrices of size m = n = 20. The ranks 
are recovered in another ordering, namely (5,1, 2,3,4). One can see that the 
ratio of the number of iterations in Phases I and II is quite different, and 
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Fig. 2: Same settings as Figure 1, but with three random initial guesses. Darkly 
shaded region represent the initial guess that was adopted. For instance, for 
the first matrix (red curve) the rank was estimated to be four in the first run 
of Phase I, and one in the second and third. The second was adopted, Phase II 
was not necessary. The final rank order is the correct (1, 2,3,4, 5). 


the overall number of required SVDs is smaller. The convergence analysis in 
Section 4, which suggests a typical convergence factor gives a partial 
explanation. The main reason we give this third plot is the following interesting 
fact: the maximum rank a matrix in the generated subspace can have is 
1 + 2 + 3 + 4 + 5 = 15. Consequently, there seems to be a principal difference 
here from the previous example in that the subspace does not contain a full-rank 
matrix. This is perhaps part of why this problem seems somewhat easier in 
terms of the total number of iterations. Indeed, a close inspection of Figure 3 
reveals that for each £, we have five singular values below machine precision 
(recall that the plot shows the singular values after projection on the subspace). 


4 Convergence analysis 

Given the hardness of the low-rank basis problem, the formulation of conditions 
for success or failure of Algorithm 5 must be a challenging task. Not to mention 
the discontinuous nature of the restart procedure, a satisfying rigorous and 
global convergence result remains a potentially interesting problem for future 
work. 

Here we confine ourselves to a local convergence analysis of Phase II for a 
single matrix given a correct rank estimate r, which is an alternating projection 
method. We will consider the simplest case where at the point of interest the 
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IQ-iO 


50 100 150 


200 250 300 

# of SVDs 


350 400 


Fig. 3: m = n = 20. All matrices in the subspace are rank-deficient, and we 
observe that the number or SVDs is fewer. 


tangent space of the manifold of rank-r matrices intersects trivially with the 
tangent space of the sphere in Ad, which is arguably the simplest possible 
assumption when it comes to local convergence of the alternating projection 
method between smooth manifolds. 

Regarding Phase I, we can at least note that if A € Ad is a normalized 
rank-one matrix, then in a neighborhood of A a single step of Algorithms 2 
and 3, respectively, will give the same result (this is also true for some similar 
choices of r discussed above). Under the same assumptions as for Phase II 
this hence shows the local convergence of Phase I toward isolated normalized 
rank-one matrices in Ad, see Corollary 5 below. Since this is a local convergence 
analysis, it of course does not fully explain the strong global performance of 
both Algorithms 2 and 3 in the rank-one case as seen in Figures 1- 3 and 
Tables 2 and 3. 

In general, using the shrinkage operator cannot be guaranteed to converge 
to a local solution. We have already noted that a matrix of rank larger than two 
cannot be a fixed point of the shrinkage operator (unless all nonzero singular 
values are the same). One could construct examples where in a fixed point 
of (8) A has the same rank as Y. but generically this seems extremely unlikely. 
Therefore, the convergence of Phase I is in general a less relevant question. 
The main problem is in which situations it provides a correct rank estimate, at 
least locally. Except for the rank-one case we have no precise arguments, but 
we give a qualitative explanation at the end of Section 4.1. 
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4.1 Local convergence of Phase II 

Algorithm 3 is nothing else than the method of alternating projections for 
finding a matrix in the intersection B n lZ r of the closed sets 

B = {A £ M : \\X\\ F = 1} 


and 

U r = {X £ R mxn : rank(A) < r}. 

The local convergence analysis of the alternating projection method for 
nonconvex closed sets has made substantial progress during the last years [3, 
42,43,50], often with reference to problems involving low-rank matrices. In 
these papers one finds very abstract result in the language of variational 
analysis or differential geometry, but validating the required assumptions for 
specific situations can be complicated. The most recent work [23, Theorem 7.3], 
however, contains a very general and comprehensive result that the method of 
alternating projections is essentially locally convergent (in the sense of distance 
to the intersection) for two semialgebraic sets of which one is bounded, which 
is the case here. Moreover, convergence is R-linear and point-wise in case of an 
intrinsically transversal intersection point [23, Theorem 6.1]. A notable special 
case, for which the result has already been obtained in [3, Theorem 5.1], is 
a clean (or nontangential) intersection point. As the assumption (14) made 
below implies such a clean intersection for the problem at hand, the subsequent 
Theorem 4 follows. Nevertheless, we provide a direct and self-contained proof for 
the problem at hand based on elementary linear algebra, which may contribute 
to the understanding of alternating projection method when specifically used 
for low-rank constraints. In Remark 6 we discuss alternatives to the main 
assumption (14) in the context of the available literature in a bit more detail, 
and provide a sufficient condition for (14) in Lemma 7. 

To state the result, we assume that A* £ T r r\ B has exactly rank r. By 
semi-continuity of rank, all matrices in a neighborhood (in W mxn ) of X have 
rank at least r. Therefore, we can locally regard Algorithm 3 as an alternating 
projection between B and the smooth manifold of matrices of fixed rank r. 
Letting A'* = U^E^Vj" be a thin SVD of A* where A* contains positive singular 
values, the tangent space to that manifold at A is [34] 


T x ,lZ r 



A B 
C 0 


[K K X ] : A G 


B £ 


p rx(m-r) 


For our analysis we make the following genericity assumption: 



T x ,n r nT x ,B = {o}. 


(14) 


Here T Xr B is the tangent space of B, which is the orthogonal complement of 
A* within A4. Since T x JZ r contains A*, (14) is expressed in terms of M as 


T x ,lZ r nM = span{A*}. 


(15) 
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We remark that (14) ultimately implies that X* is an isolated point of lZ r D B , 
so actually it then holds TZ r D A4 = span{X*}. Then we have the following 
result. 


Theorem 4 Assume X* £ 1f Cl B has rank r, and that this rank is used in 
Algorithm 3. If (14) holds, then for X £ B close enough to X* the new iterate 
X„ 


i- ne w = \\Vm(t T (X))\\ f constructed by the algorithm is uniquely defined, and 


\X nP ,n X* 


<cos0 + O(||X-X*|||.)> 


||X - X*|| F 

where 9 £ (0, |] is the subspace angle between T\,B and Tx,H r , defined by 

cos 9 = max ■ 

A'GTx.B x f y f 

YET x ,lZ. r 


As a consequence, the iterates produced by Algorithm 3 are uniquely defined 
and converge to X* at a linear rate for close enough starting values. 


In accordance with our observations in Section 3.4, we also have a result 
for Phase I in the rank-one case. 


Corollary 5 If r = 1, the statement of Theorem 4 also holds for Algorithm 2. 
In fact, both algorithms produce the same iterates in some neighborhood of X*. 


Here it is essential that the value of shift r is bounded below by a fixed 
fraction of the Frobenius norm of X, as it is the case for in Algorithm 2, a 
lower bound being 6 /y/min( to, n ). 

Proof of Theorem 4 Since X and X* are in B, we can write 
X-X* =£ + 0(||X-X*||!) 


with E £ Tx,B. We partition the principal error E as 


E = [t/* U^] 


A B 
CD 


W* V, 


_LiT 


(16) 


Of course, ||i?|| F = ||A|||, + \\B\\ 2 F + ||C|||. + ||.D|| F , and due to (13), our 
assumption implies 

\\D\\ 2 f > sin 2 9- \\E\\%. (17) 

Since X* has rank r, it follows that all matrices in some neighborhood 
have a unique best rank-r approximation (by perturbation arguments for the 
singular values). In this neighborhood X new is uniquely defined. To relate ||i£|| 
to ||7^(X') — X*|| we consider the two matrices 


F = 


I CZf 1 
-CS - 1 I 


[U. u. 


_L-|T 
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and 


G = [K V^] 


I -£- l B 
E~ l B I 


both of which are orthogonal up to 0(||-E’||p): 

II f t f - I\\ F = o(\\e\\ 2 f ), || G t G - J|| F = Odl^ll^). 1 

Therefore, denoting by F, G the orthogonal polar factors of F, G, respectively, 
we also have 2 


F = F + 0(|| J B|| 2 ), G = G + 0(\\E\\%). 


One now verifies that 

FXG = FXG + 0(\\E\\ 2 F ) 


E* + A 0 
0 D 


0(\\E\\f), 


(18) 


or, since F and G are orthogonal, 


X = F 1 


E* + A 0 
0 D 


g t + o(\\e\ 


For E small enough, the best rank-r approximation of the principal part is 
obtained by deleting D. Hence, 


T r (X) =T r (F 


E '* 1 A 0 
0 D 


= F 1 


E * -f A 0 

0 0 


G t )+0(\\E\\f) 

G T + 0(\\E\\ F ). 


(19) 


To get the last equality we have used results from matrix perturbation the¬ 
ory [65] (see also [59, Sec.V.4]), which shows that under the perturbation 
G(ll-E'Hl’), the singular subspace corresponding to the r largest singular values 
of X gets perturbed by 0 ( 11 ) where gap is the smallest distance between 

the singular values of E* + A and those of D. For ||-E||f sufficiently small such 
that ||.E||f = o(o■ m in(Ai , *)), this bound is 0(||-E|||). Together with the fact that 
the perturbation in the singular values is bounded also by 0(||F1|||) (since the 
condition number of singular values is always 1), we obtain the final equality 
above. 

Therefore, taking also (18) into account, we obtain 


IITrpQ -X.\\ F = \\FT r {X)G - FX*G\\ f + 0(\\E\\ 2 f ) 


A + U* 0 


E * — B 

i 

o 

o 


-C CE~ 1 B\ 


+ 0(\\E\\ 2 f) 


A B 
C 0 


o(\\e\\ 2 f ). 


1 Here and in the following, the error constant behind 0(||£?||ir) depends mainly on the 
condition of 27*, which can be very large, but is fixed in this type of local analysis. 

2 From a polar decomposition Z T = UP one gets Z T Z — I = (Z T — U)(P+I)U T , and since 
the singular values of (P + I)U T are all at least 1, it follows that \\Z —U T \\f < \\Z T Z — I\\f- 
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Here we used 0{\\CS~ 1 B\\ 2 F ) = 0(||P||^), which holds since HA” 1 ^ can 
be regarded as a constant that does not depend on ||.E||f. Since 0(||£’||) = 
0(||A — A*||), we arrive at 


II TAX) - x t \\ F _ ^/\\Ef F - \\P\\* F + Q(||A - A,|[ 2 F ) 
|| A' — X*||f II-E'IIf + 0(||A — A*|||,) 

<vY — sin 2 9 + 0(|| A — A*|||.) 

= cos0 + 0(||A - A*||f), 


where we have used (17). Since A* £ A4, it now follows that 

||-P.m(7^(A)) — A*||f < ||77-(A) - A*||f < cos0||A - A*|| F + 0(||A - A,||f.). 

.( 21 ) 

Finally, we consider the normalization step. Recalling ||A*||f = 1, by a simple 
geometric argument on the unit sphere we obtain 


Y 


Y 


— A* 


< 


\Y — X 


cos< 


♦ IF, 


where (j> £ [0, §] such that sin<(> = ||Y — A*||f- By Taylor expansion, 
1 + 0(£ 2 ). Substituting £ = sinc/> and Y = P.m( 7)-(A)) in (22) gives 


( 22 ) 




11 A new - A*||f < ||P M (T r (A)) - A*||f + 0(||P M (r r (A)) - A*||f.). 


Using (21), we arrive at 

||A new - A*||f < cos0||A - A,||f + 0(||A - A*|||), 


completing the proof. 


□ 


From Theorem 4 we can obtain a rough estimate for the convergence factor 
that we can expect to observe in practice. Consider the “generic” case where 
the error term E in (16) is randomly distributed, that is, each element is of 
comparable absolute value. Then we have \\D\\p ss n l ||A||f, and plugging 
this into (20) gives l|7 |j-ff~^J |F < + 0(||A-A*||f,). This suggests 

that we typically expect a convergence factor ss This estimate reflects 

the experiments quite well; see Section 5.2. 

The above proof provides some insight into the behavior of Algorithm 2 in 
Phase I. In this case T r (X) in (19) is replaced by <S T (A). Provided again that 
we start with a matrix A close to A* so that ||P ||2 < t, the operation <S r (A) 
again removes the D term, emphasizing the components towards A* just like 
in Phase II as shown above. However, now the A* + A term is also affected, 
and thus Phase I stagnates where the thresholding effect in A* + A is balanced 
with the error terms that come in from the projection Vm- Then the rank 
estimate r is of correct rank rank(A'*), but neither A nor Y in Algorithm 2 is 
close to A*; reflecting the remark at the end of Section 3.1. 
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Remark 6 The conditions (14), resp. (15), allow for a simple proof but of course 
impose some restrictions, most obviously d = dim(Ad) < (m — r)(n — r - ) + 1. In 
a seminal attempt, Lewis and Malick [43] obtained the local convergence of the 
method of alternating projections between two smooth submanifolds Ad and A f 
of IT towards some X * G Ad HA/" under the condition that Tx,M+Tx,Af = R" 
(transversality). This allows for a non-trivial intersection, but imposes lower 
bounds on the dimensions, in our case d > (m — r){n — r) + 1. Andersson 
and Carlsson [3] relaxed the condition to T X ,(M fl A f) = T X ,M fl T x , A/" 
under the assumption that Ad fl J\f is a C 2 manifold. This does not impose 
restriction on the dimensions, and contains (15) as a special cases. Still these 
conditions can fail in the situation at hand (Ad a subspace of 7?. mxn , Af = TZ ,), 
for instance when Ad = Tx, TZ r . to mention one counter-example. As already 
mentioned, the recent results of Drusvyatskiy, Ioffe and Lewis [23] subsumes 
the previous results under the more general condition of intrinsic transversality. 
Another recent work, by Noll and Rondepierre [50], contains local convergence 
for alternating projections under very weak but abstract assumptions, which 
do not involve tangential conditions in first place. It may be that their result 
applies to our setting, but we have not been able to validate this. 

We conclude with a sufficient condition for (14) which might be useful in 
some very structured cases; see Proposition 9 in the appendix for an example. 
We denote by ran(A) and ran(X T ) the column and row space of a matrix A', 
respectively. 

Lemma 7 Let A'* G Ad have rank r. Assume there exists a (d—1)- dimensional 
subspace Ad C Ad complementary to span{A*} with the following property: for 
every X G Ad it holds ran(A'*) n ran(A) = 0 and ran(Aj) fl ran(X T ) = 0. 
Then (14) holds. 

Proof Consider A' = aX * + f3X G Ad with X G Ad. Let P and Q denote 
the orthogonal projections on ran(A*)- L and ran(Aj)- L , respectively. Then 
X G T x n r if and only if 

0 = PXQ t = /3PXQ t . 

It holds PXQ T ^ 0. To see this we note that PXQ T = 0 would mean 
ran(A'Q T ) C ran(A*), which by assumption implies XQ T = 0. But then 
ran(X T ) C ran(Xj), a contradiction. Hence A G T x ,LZ r if and only if /3 = 0, 
which proves the equivalent condition (15). □ 


5 Experiments 

Unless stated otherwise, the standard parameters in the subsequent experiments 
were T to i = 10~ 3 , 6 = 0.1 and changeit = 50 in Alg 2, and maxit = 1000 
and restartit = 50 for both Phase I and Phase II in Algorithm 5. The typical 
termination tolerance in Phase II was tol = 10 -14 . The restart tolerance 
restarttol in Alg. 4 was set to 10 -3 but was almost never activated. 
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5.1 Synthetic averaged examples 

We fix m = n = 20, d = 5, and a set of ranks (n,..., r^). We then randomly 
generate d random matrices M\ , ..., M r j of corresponding rank by forming 
random matrices Ug £ M mXr< and Vg £ R" xr,: w ith orthonormal columns, 
obtained from the QR factorization of random Gaussian matrices using MAT- 
LAB’s randn and setting Mg = U. To check the average rate of success of 
Algorithm 5, we run it 100 times and calculate 

— the average sum of ranks Yle= l ran k {Ye.) found by Phase I of the algorithm, 

/ d \ 1/2 

the average truncation error I \\Xg — T rc (Xg)\\p ) after Phase I, 

— the average truncation error ( Ylg=\ II W — 7^(A^)|||, j after Phase II, 

— the average iteration count (# of SVDs computed) in each Phase. 

Table 2 shows the results for some specific choices of ranks. Phase II was always 
terminated using tol = 10~ 14 , and never took the maximum 1000 iterations. 
From Table 2 we see that the ranks are sometimes estimated incorrectly, 
although this does not necessarily tarnish the final outcome. 


Table 2: Synthetic results, random initial guess. 


exact ranks 

av. sum(ranks) 

av. Phase I err (iter) 

av. Phase II err (iter) 

(i, i, i, i. i) 

5.05 

2.59e-14 (55.7) 

7.03e-15 (0.4) 

( 2 , 2 , 2 , 2 , 2 ) 

10.02 

4.04e-03 (58.4) 

1.04e-14 (9.11) 

( 1 , 2 , 3 , 4 , 5 ) 

15.05 

6.20e-03 (60.3) 

1.38e-14 (15.8) 

( 5 , 5 , 5 , 10 , 10 ) 

35.42 

1.27e-02 (64.9) 

9.37e-14 (50.1) 

( 5 , 5 , 10 , 10 , 15 ) 

44.59 

2.14e-02 (66.6) 

3.96e-05 (107) 


A simple way to improve the rank estimate is to repeat Phase I with several 
initial matrices, and adopt the one that results in the smallest rank. Table 3 
shows the results obtained in this way using five random initial guesses. 


Table 3: Synthetic results, random initial guess from subspace repeated 5 times. 


exact ranks 

av. sum(ranks) 

av. Phase I err (iter) 

av. Phase II err (iter) 

(i, i, i, i, i) 

5.00 

6.77e-15 (709) 

6.75e-15 (0.4) 

( 2 , 2 , 2 , 2 , 2 ) 

10.00 

4.04e-03 (393) 

9.57e-15 (9.0) 

(1,2,3,4,5) 

15.00 

5.82e-03 (390) 

1.37e-14 (18.5) 

( 5 , 5 , 5 , 10 , 10 ) 

35.00 

1.23e-02 (550) 

3.07e-14 (55.8) 

( 5 , 5 , 10 , 10 , 15 ) 

44.20 

2.06e-02 (829) 

8.96e-06 (227) 


We observe that the problem becomes more difficult when the ranks vary 
widely. As mentioned in Section 3.1, choosing the initial guesses as in [52] also 
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worked fine, but not evidently better than random initial guesses as in Table 3. 
From the first rows in both tables we validate once again that for the rank-one 
case, Phase II is not really necessary - Phase I is recovering a rank-one basis 
reliably. 

5.1.1 Comparison with tensor CP algorithm 

As we describe in Appendix A, if the subspace is spanned by rank-one matrices, 
then the CP decomposition (if successfully computed; the rank is a required 
input) of a tensor with slices Mjt, where Mi,..., M r j is any basis of M., provides 
a desired rank-one basis. Here we compare our algorithm with the CP-based 
approach. Specifically, we compare with the method cpd in Tensorlab [56] with 
the exact decomposition rank (the dimension d of A4) as input. By default, 
this method is based on alternating least-squares with initial guess obtained by 
an attempt of simultaneous diagonalization. When applied to a rank-one basis 
problem, cpd often gives an accurate CP decomposition with no ALS iteration. 

As seen from the tables above, given a rank-one basis problem, our Algo¬ 
rithm 5 will typically terminate after Phase I. On the other hand, since we 
assume a rank-one basis to exist (otherwise the CP approach is not necessar¬ 
ily meaningful for finding a subspace basis), we can also use the alternating 
projection algorithm from Phase II with rank one directly from random ini¬ 
tializations. In summary, we obtain three error curves: one for tensorlab, one 
for soft thresholding (Phase I) and one for alternating projection (Phase II). 
The errors are computed as in the experiments in Section 3.2 via the subspace 
angle. 

We also present the runtime to show that our algorithms are not hopelessly 
slow in the special rank-one case. Just running Phase II results in an algorithm 
faster than Algorithm 5, but it is still slower than cpd. Note that while Tensorlab 
is a highly tuned toolbox, we did not try too hard to optimize our code regarding 
the choice of parameters and memory consumption. More importantly, unlike 
cpd our algorithm does not require the rank r and is applicable even when 
r > 1. 

Growing matrix size n We first vary the matrix size n, fixing the other param¬ 
eters. The runtime and accuracy are shown in Figure 4. We observe that if the 
CP rank is known, the CP-based algorithm is both fast and accurate. 

Growing dimension d We next vary the dimension d , in particular allowing 
it to exceed n (but not n 2 ). In this case, linear dependencies among the left 
factors and right factors b^, respectively, of a rank-one basis a 1 bf,..., a d bj 
must necessarily occur. It is known that in this scenario obtaining an exact CP 
decomposition via simultaneous diagonalization, as in part attempted by cpd, 
becomes a much more subtle problem, see the references given in Section A.l. 
And indeed, we observe that for d > n the accuracy of Tensorlab deteriorates, 
while our methods do not. The runtime and accuracy for n = 10 are shown in 
Figure 5. However, this effect was less pronounced for larger m = n. 
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Fig. 4: Rank-1 basis matrices r = 1, fixed d = 10, varying m = n between 50 
and 500. The accuracy is not identical but nearly the same. Tensorlab performs 
well. 



Fig. 5: Rank-1 basis matrices r = 1, Fixed m = n = 10, varying d between 2 
and 20. Our Algorithm gives better accuracy when d > n. 


We conclude that the CP-based algorithm is recommended if (i) the basis 
matrices are known to be of rank one, and (ii) the dimension is lower than 
min(m, n). Our algorithm, on the other hand, is slower, but substantially 
different in that it does not need to know that a rank-one basis exist, but will 
detect (in Phase I) and recover it automatically. Also it seems indifferent to 
linear dependent factors in the CP model. 


5.2 Quality of the convergence estimate 

In Section 4 we analyzed the convergence of Algorithm 3 in Phase II and 
showed that, when the error term is randomly distributed the convergence 
factor would be roughly recall the remark after Theorem 4. Here we 
illustrate with experiments how accurate this estimate is. 
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In Figure 6 we plot a typical convergence of \\T r {X) — X\\p as the iterations 
proceed. We generated test problems (randomly as before) varying n on the 
left (n = 10,100) and varying r on the right (r = 2,10). The dashed lines 
indicate the convergence estimate (y/^Y after the £tli iteration. Observe that 
in both cases the estimated convergence factors reflect the actual convergence 
reasonably well, and in particular we verify the qualitative tendency that (i) 
for fixed matrix size n, the convergence is slower for larger rank r, and (ii) for 
fixed rank r, the convergence is faster for larger n. 




Iteration Iteration 


Fig. 6: Convergence of \\T r (X) — X\\p as the iterations proceed in Phase II. 
The convergence factor is faster for larger matrices when the rank is fixed 
(left), and slower for higher rank when the matrix size is fixed (right), reflecting 
Theorem 4. 


5.3 Image separation 

One well known use of the SVD is for data and image compression, although 
currently it is no longer used for the JPEG format or other modern image 
formats. It is known that most images can be compressed significantly without 
losing the visible quality by using a low-rank approximation of the matrix that 
represents the image. 

Since each grayscale image can be expressed as a matrix, here we apply 
our algorithm to a set of four incomprehensible images (shown as “mixed” in 
Figure 7) that are random linear combinations of four ‘original’ low-rank images. 
The latter were obtained from truncating four pictures (the famous ‘Lena’, along 
with photos of Audrey Hepburn, Angelina Jolie and Arnold Schwarzenegger, 
taken from the labeled faces in the wild dataset [37]) to rank exactly 15 using 
singular value decomposition. As shown in Figure 7, we can recover these 
low-rank images from the mixed ones using our Algorithm. In this experiment 
we had to decrease the minimal threshold in Phase I to r to i = 5 • 10~ 4 for 
obtaining the correct rank guess 15 for all four images. The standard choice 
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Ttoi = 10~ 3 underestimated the target rank as 14, which resulted in poorer 
approximations in the subspace after Phase II (since it does not contain a rank 
14 matrix), that is, the gap after singular value number 15 was less pronounced 
than in Fig. 7. Nevertheless, the visual quality of the recovered images was 
equally good. 

We also note that in this experiment, and only in this experiment, restarting 
(Algorithm 4) was invoked several times in Phase II, on average about 2 or 3 
times. 

Of course, separation problems like this one have been considered extensively 
in the image processing literature [1,6,67], which is known as image separation, 
and we make no claims regarding the actual usefulness of our approach in image 
processing. This example is included simply for an illustrative visualization of 
the recovery of the low-rank matrix basis by Algorithm 5. In particular, our 
approach would not work well if the matrices of the images are not low-rank 
but have gradually decaying singular values. This is the case here for the 
original images from the database, which are not of low rank. Without the 
initial truncation our algorithm did not work in this example. 


original 






Fig. 7: We start with the middle images, which are obtained as random linear 
combinations of the rank-15 images (of size 200 x 200) in the top row. We apply 
our algorithm to obtain the four images in the third row lying in the same 
subspace (they were sorted accordingly). The fourth row shows the singular 
values of the recovered images. 
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5.4 Computing exact eigenvectors of a multiple eigenvalue 

Eigenvectors of a multiple eigenvalue are not unique. For example, the identity 
matrix / has any vector as an eigenvector. However, among the many possibili¬ 
ties one might naturally wish to obtain “nice” eigenvectors: for example, the 
columns of I might be considered a good set of “nice” eigenvectors for /, as 
they require minimum storage. 

Numerically, the situation is even more complicated: a numerically stable 
algorithm computes eigenpairs (A, x) with residual Ax — Ax = 0(u||A||), where 
u is the unit roundoff. Since the eigenvector condition number is [58, 

Sec. 1.3] where gap is the distance between A and the rest of the eigenvalues, 
the accuracy of a computed eigenvector is typically 0( u jj^ )■ This indicates 
the difficulty (or impossibility in general) of computing accurate eigenvectors 
for near-multiple eigenvalues in finite precision arithmetic. The common com¬ 
promise is to compute a subspace corresponding to a cluster of eigenvalues, 
which is stable provided the cluster is well separated from the rest [4, Sec. 4.8]. 

Here we shall show nonetheless that it is sometimes possible to compute 
exact eigenvectors of (near) multiple eigenvalues, if additional property is 
present that the eigenvectors are low-rank when matricized. As we discussed in 
the introduction, this also lets us compress the memory to store the information. 
Below we illustrate how this can be done with examples. 

5-4-1 Eigenvectors of a multiple eigenvalue of a circulant matrix 

As is well known, the eigenvector matrix of a circulant matrix is the FFT 
matrix [27, Sec. 4.8]. One can easily verify that each column of an n 2 x n 2 
FFT matrix F is rank-one when matricized to n x n, exemplifying a strong 
low-rank property. 

2 2 

Let us consider a circulant matrix A £ C n xn defined by 

A = ^FAF*, (23) 

n z 


where A = diag(Ai,..., A„ 2 ). 

Suppose for the moment that one is oblivious of the circulant structure (or 
perhaps more realistically, we can think of a matrix A that is not circulant 
but has d eigenvectors consisting of columns of A; such a matrix gives similar 
results) and attempts to compute the d smallest eigenvalues of A by a standard 
algorithm such as QR. 

For the reason explained above, the numerically computed eigenvectors Xi 
obtained by MATLAB’s eig have poor accuracy. For concreteness suppose 
that Ai = A 2 = ■ • • = Xd and A d+i = A^+i-i + 1 for integers *, and we look 
for the eigenvectors corresponding to the first d eigenvalues. With n = 10 and 
d = 5, the smallest angle between Xi and the first d columns of the Fourier 
matrix were 0(1) for each i (it should be 0 if x t was exact). Nonetheless, the 
subspace spanned by the d computed eigenvectors [x\, ... ,£’d] has accuracy 
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0(u), as there is sufficient gap between Afc and A^+i- We therefore run our 
algorithm with the n x n matrix subspace 

M = span{mat(cci),..., mat(xd)}- 

Our algorithm correctly finds the rank (= 1), and finds the eigenvectors 
[xi ,..., Xd \, each of which is numerically rank-one and has 0 (u) angle with a 
column of the Fourier matrix. This is an example where by exploiting structure 
we achieve high accuracy that is otherwise impossible with a backward stable 
algorithm; another established example being the singular values for bidiagonal 
matrices [21, Ch. 5]. 

For example, we let n 2 = 20 2 and compute the smallest 5 eigenvalues of a 
circulant matrix A = diag(l + ei, 1 + € 2 ,1 + £3, 1 + £4, 1 + € 5 ,6 ,..., n 2 )F* 
where = O(10~ 10 ) was taken randomly. The matrix A therefore has a cluster 
of five eigenvalues near 1. The “exact” eigenvectors are the first five columns 
of the FFT matrix. 

Table 4: Accuracy (middle columns) and memory usage for computed eigenvec¬ 
tors of a 20 2 x 20 2 circulant matrix. 



Vl 

V 2 

V3 

VA 

V5 

memory 

eig 

4.2e-01 

1.2e+00 

1.4e+00 

1.4e+00 

1.5e+00 

O(n^) 

eig+Alg. 5 

1.2e-12 

1.2e-12 

1.2e-12 

1.2e-12 

2.7e-14 

0{n) 


Note that our algorithm recovers the exact eigenvector of a near-multiple 
eigenvalue with accuracy O(10~ 12 ). Furthermore, the storage required to store 
the eigenvectors has been reduced from 5n 2 to 5 n. 

5.4-2 Matrices with low-rank eigenvectors 

Of course, not every vector has low-rank structure when matricized. Nonetheless, 
we have observed that in many applications, the eigenvectors indeed have a 
low-rank structure that can be exploited. This observation may lead to the 
ability to deal with problems of scale much larger than previously possible. 

Circulant matrices are an important example, as we have seen above (which 
clearly includes symmetric tridiagonal, symmetric banded, etc). We have ob¬ 
served that a sparse perturbation of a circulant matrix also has such structure. 

Other examples come from graph Laplacians. We have numerically observed 
that typically the Laplacian matrix of the following graphs have eigenvectors 
(corresponding to the smallest nonzero eigenvalues) that are low-rank: binary 
tree, cycle, path graph and the wheel graph all have rank 3 irrelevant of the 
size, the lollipop graph has rank 4 (regardless of the ratio of the complete/path 
parts), and the ladder graph has rank 2 and circular ladder (rank 2) regardless 
of the size, and barbell always has rank 5. Clearly, not every graph has such 
structure: a counterexample is a complete graph. Our empirical observation is 
that sparse graphs tend to have low-rank structure in the eigenvectors. 
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Note that the low-rankness of the eigenvector depends also on the ordering 
of the vertices of the graph. 5 An ordering that seemed natural have often 
exhibited low-rank property. 

Our algorithm does not need to know a priori that a low-rank structure is 
present, as its phase I attempts to identify whether a low-rank basis exists. We 
suspect that identifying and exploiting such structure will lead to significant 
improvement in both accuracy and efficiency (both in speed and memory). 
Identifying the conditions under which such low-rank structure is present is 
left as an open problem. We expect and hope that the low-rank matrix basis 
problem will find use in applications beyond those described in this paper. 


A Finding rank-one bases via tensor decomposition 

In this appendix, we describe the rank-one basis problem as a tensor decomposition problem. 
Recall that in this problem, we are promised that the given subspace A4 is spanned by rank- 
one matrices. Thus we can apply Algorithm 3 (Phase II) with the precise rank guess directly. 
Alternatively, we can also stop after Algorithm 2 (Phase I), which in practice performs well 
(see Section 5.1). The following tensor decomposition viewpoint leads to further algorithms. 

Let Mi ,..., Md be an arbitrary basis of A4 , and let 7~ be the m x n x d tensor whose 
3-slices are Mi,..., M^. The fact that A4 possesses a rank-one basis is equivalent to the 
existence of d (and not less) triplets of vectors (a^, b^, c^) where a £ £ R m , £ R n , c £ £ R d , 
such that 

d 

M k = ^2c k£ a £ bJ, k = l,...,d (24) 

1=1 

(here c k ^ denotes the kth entry of c^). Namely, if such triplets (a £ ,b £ ,c £ ) exist, then the 
assumed linear independence of the M k automatically implies that rank-one matrices a^bj 
belong to A4. Using the outer product of vectors (denoted by o), we may express this relation 
in terms of the tensor 7~ as 

d 

T=^ a^ob|OC|. (25) 

£=1 

This type of tensor decomposition into a sum of outer products is called the CP decomposition , 
and is due to Hitchcock [36] (although the term CP decomposition appeared later). In general, 
the smallest d required for a representation of the form (25) is called the (canonical) rank of 
the tensor 7~. We refer to [40] and references therein for more details. In summary, we have 
the following trivial conclusion. 


Proposition 8 The d-dimensional matrix space A4 = span (Mi,..., Mj) possesses a rank- 
one basis if and only if the tensor 1~ whose 3-slices are the Mi,..., Md has (canonical) rank 
d. Any CP decomposition (25) of 7~ provides a rank-one basis a x b^,... ,a rf bj of A 4. 


We remark that computing the rank of a general third-order tensor is known to be 
NP-hard [33,35]. Therefore, it is NP-hard to check whether a matrix space A4 admits a 
rank-one basis. Nevertheless, we might try to find a rank-one basis by trying to calculate 
a CP decomposition (25) from linearly independent Mi,..., Md . We outline two common 
algorithms. 


3 


We thank Yuichi Yoshida for this observation. 
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A.l Simultaneous diagonalization 

If the tensor 7~ G M mXnXr is known to have rank d and d < min(m,n), it is “generi- 
cally” possible to find a CP decomposition (25) in polynomial time using simultaneous 
diagonalization [16,19,41]. 

Let us introduce the factor matrices A = [ai,..., a^] G M rnXd , B = [b^,..., b^] G R nXd , 
and C = [ci,..., c^] G R dxd .Then (24) reads 

Mfc = ADfcB T , fc = l,...,d, 

where D & = diag(c^), in which cjT denotes the kth row of C. In other words, a rank-one basis 
exists, if the Mi,..., Md can be simultaneously diagonalized. The basic idea of the algorithm 
of Leurgans, Ross, and Abel in [41] is as follows. One assumes rank(A) = rank(B) = d. Pick 
a pair (k,£), and assume that D & and D£ are invertible, and that has d distinct 

diagonal entries. Then it holds 

M k M+ A = AD k B T {B T ) + D~ 1 A + A = A 

where superscript + denotes the Moore-Penrose inverse. In other words, A contains the 
eigenvectors of M fc M^~ to distinct eigenvalues, and is essentially uniquely determined (up to 
scaling and permutation of the columns). Alternatively, for more numerical reliability, one 
can compute an eigenvalue decompositions of a linear combination of all M fc M^~ instead, 
assuming that the corresponding linear combination of D k DJ has distinct diagonal entries. 
Similarly, B can be obtained from an eigendecomposition, e.g. of M k ()+ or linear 
combinations. Finally, 

D k =A+M k (B T )+, k = l,...,d, 

which gives C. The algorithm requires the construction of Moore-Penrose inverses of matrices 
whose larger dimension is at most max(m, n). Hence, the complexity is 0(mn 2 ). 

The condition that the or a linear combination of them should have distinct 

diagonal entries is not very critical, since it holds generically, if the matrices Mi,..., Md 
are randomly drawn from A4, or, when this is not possible, are replaced by random linear 
combination of themselves. The condition rank(A) = rank(B) = d on the other hand, is 
a rather strong assumption on the rank-one basis a x b^,... , a^bj. It implies uniqueness 
of the basis, and restricts the applicability of the outlined algorithm a priori to dimension 
d < min(m, n) of Ml. There is an interesting implication on the condition (14) that we used 
for the local convergence proof of our algorithms. Theorem 4 and Corollary 5 therefore apply 
at every basis element a^bj in this setting. 

Proposition 9 Let a 1 b^,..., a^bj be a rank-one basis such that rank(A) = rank(B) = d. 
Then (14) holds at any basis element X * = Xg = a^bj\ 

Proof This follows immediately from Lemma 7 by taking Ml = span{a fc bJT : k ^ £}. □ 

De Lathauwer [16] developed the idea of simultaneous diagonalization further. His 
algorithm requires the matrix C to have full column rank, which in our case is always true 
as C must contain the basis coefficients for d linearly independent elements Mi,..., M^. 
The conditions on the full column rank of A and B can then be replaced by some weaker 
conditions, but, simply speaking, too many linear dependencies in A and B will still lead 
to a failure. A naive implementation of De Lathauwer’s algorithm in [16] seems to require 
0(n 6 ) operations (assuming m = n). 

Further progress on finding the CP decomposition algebraically under even milder 
assumptions has been made recently in [22]. It is partially based on the following observation: 
denoting by mg — vec (Mi) the n 2 x 1 vectorization of M^ (which stacks the column on top 
of each other), and defining Matr(7”) = [mi,..., m r ] G M mnXd ? we have 


Matr(T) = (B© A)C T , 


(26) 
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where B © A = [ai © bi,..., © b^] G R mnXd is the so called Khatri-Rao product of B 

and A (here © is the ordinary Kronecker product). If C (which is of full rank in our scenario) 
would be known, then A and B can be retrieved from the fact that the matricizations of the 
columns of Matr(7~)C —T = B © A must be rank-one matrices. In [22] algebraic procedures 
are proposed that find the matrix C from T. 

Either way, by computing the CP decomposition for 1~ we can, at least in practice, 
recover the rank one basis {a^bj^} in polynomial time if we know it exists. This is verified in 
our MATLAB experiments using Tensorlab’s cpd in Section 5.1.1. 


A.2 Alternating least squares 


An alternative and cheap workaround are optimization algorithms to calculate an approximate 
CP decomposition of a given third-order tensor, a notable example being alternating least 
squares (ALS), which was developed in statistics along with the CP model for data analysis [13, 
30]. In practice, they often work astonishingly well when the exact rank is provided. 

Assuming the existence of a rank-one basis, that is, rank(7~) = d , the basic ALS algorithm 
is equivalent to a block coordinate descent method applied to the function 


/(A.B.C) 


1 

2 


d 

T — ai o bi o Cl 

t= l 


2 


F 


The name of the algorithm comes from the fact that a block update consists in solving a 
least squares problem for one of the matrices A, B or C, since / is quadratic with respect 
to each of them. It is easy to derive the explicit formulas. For instance, fixing A and B, an 
optimal C with minimal Frobenius norm is found from (26) as C = Matr(7”) T (B T © A T )+. 
The updates for the other blocks look similar when using appropriate reshapes of 1~ into a 
matrix; the formulas can be found in [40]. 

The question of convergence of ALS is very delicate, and has been subject to many 
studies. As it is typical for these block coordinate type optimization methods for nonconvex 
functions, convergence can, if at all, ensured only to critical points, but regularization might 
be necessary, see [62,44,48,66,64,63] for some recent studies, and [40] in general. Practical 
implementations are typically a bit more sophisticated than the simple version outlined 
above, for instance the columns of every factor matrix should be rescaled during the process 
for more numerical stability. Also a good initialization of A, B, and C can be crucial for the 
performance. For instance one may take the leading HOSVD vectors [18] or the result of 
other methods [39] as a starting guess for ALS. 
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